Exploring IPython


This notebook is intended to be an exploration into what IPython offers for data analysis .

Hello, Python

The IPython notebook is an application to build interactive computational notebooks.

Notebooks are composed of many "cells", which can contain text (like this one), or code (like the one below).


In [1]:
x = [1, 2, 3, 4, 5]
for item in x:
    print "Item is ", item


Item is  1
Item is  2
Item is  3
Item is  4
Item is  5

Python Libraries

I will be using several different libraries throughout .


In [8]:
#IPython is what is running the notebook
import IPython
print "IPython version:      %6.6s (need at least 1.0)" % IPython.__version__

# Numpy is a library for working with Arrays
import numpy as np
print "Numpy version:        %6.6s (need at least 1.7.1)" % np.__version__

# SciPy implements many different numerical algorithms
import scipy as sp
print "SciPy version:        %6.6s (need at least 0.12.0)" % sp.__version__

# Pandas makes working with data tables easier
import pandas as pd
print "Pandas version:       %6.6s (need at least 0.11.0)" % pd.__version__

# Module for plotting
import matplotlib
print "Mapltolib version:    %6.6s (need at least 1.2.1)" % matplotlib.__version__

# SciKit Learn implements several Machine Learning algorithms
import sklearn
print "Scikit-Learn version: %6.6s (need at least 0.13.1)" % sklearn.__version__

# Requests is a library for getting data from the Web
import requests
print "requests version:     %6.6s (need at least 1.2.3)" % requests.__version__

# Networkx is a library for working with networks
import networkx as nx
print "NetworkX version:     %6.6s (need at least 1.7)" % nx.__version__


IPython version:       2.2.0 (need at least 1.0)
Numpy version:         1.9.0 (need at least 1.7.1)
SciPy version:        0.14.0 (need at least 0.12.0)
Pandas version:       0.14.1 (need at least 0.11.0)
Mapltolib version:     1.4.0 (need at least 1.2.1)
Scikit-Learn version: 0.15.2 (need at least 0.13.1)
requests version:      2.4.1 (need at least 1.2.3)
NetworkX version:      1.9.1 (need at least 1.7)

Hello matplotlib

The notebook integrates nicely with Matplotlib, the primary plotting package for python. This should embed a figure of a sine wave:


In [9]:
#this line prepares IPython for working with matplotlib
%matplotlib inline  

# this actually imports matplotlib
import matplotlib.pyplot as plt  

x = np.linspace(0, 10, 30)  #array of 30 points from 0 to 10
y = np.sin(x)
z = y + np.random.normal(size=30) * .2
plt.plot(x, y, 'ro-', label='A sine wave')
plt.plot(x, z, 'b-', label='Noisy sine')
plt.legend(loc = 'lower right')
plt.xlabel("X axis")
plt.ylabel("Y axis")


Out[9]:
<matplotlib.text.Text at 0xab8a0fac>

Hello Numpy

The Numpy array processing library is the basis of nearly all numerical computing in Python.


In [10]:
print "Make a 3 row x 4 column array of random numbers"
x = np.random.random((3, 4))
print x
print

print "Add 1 to every element"
x = x + 1
print x
print

print "Get the element at row 1, column 2"
print x[1, 2]
print

# The colon syntax is called "slicing" the array. 
print "Get the first row"
print x[0, :]
print

print "Get every 2nd column of the first row"
print x[0, ::2]
print


Make a 3 row x 4 column array of random numbers
[[ 0.18677756  0.62051019  0.60317618  0.18590372]
 [ 0.58447024  0.89845987  0.84403132  0.76375834]
 [ 0.45951986  0.60146715  0.70093769  0.93143209]]

Add 1 to every element
[[ 1.18677756  1.62051019  1.60317618  1.18590372]
 [ 1.58447024  1.89845987  1.84403132  1.76375834]
 [ 1.45951986  1.60146715  1.70093769  1.93143209]]

Get the element at row 1, column 2
1.84403131511

Get the first row
[ 1.18677756  1.62051019  1.60317618  1.18590372]

Get every 2nd column of the first row
[ 1.18677756  1.60317618]

Print the maximum, minimum, and mean of the array. This does not require writing a loop.


In [12]:
print "Max is  ", x.max()
print "Min is  ", x.min()
print "Mean is ", x.mean()


Max is   1.93143208882
Min is   1.18590372255
Mean is  1.61503701695

Calling the x.max function again, but use the axis keyword to print the maximum of each row in x.


In [13]:
print x.max(axis=1)


[ 1.62051019  1.89845987  1.93143209]

Simulating 500 coin "fair" coin tosses (where the probabily of getting Heads is 50%, or 0.5)


In [14]:
x = np.random.binomial(500, .5)
print "number of heads:", x


number of heads: 253

Repeating this simulation 500 times to plot a histogram of the number of Heads (1s) in each simulation


In [15]:
# 3 ways to run the simulations

# loop
heads = []
for i in range(500):
    heads.append(np.random.binomial(500, .5))

# "list comprehension"
heads = [np.random.binomial(500, .5) for i in range(500)]

# pure numpy
heads = np.random.binomial(500, .5, size=500)

histogram = plt.hist(heads, bins=10)


The Monty Hall Problem

In a gameshow, contestants try to guess which of 3 closed doors contain a cash prize (goats are behind the other two doors). Of course, the odds of choosing the correct door are 1 in 3. As a twist, the host of the show occasionally opens a door after a contestant makes his or her choice. This door is always one of the two the contestant did not pick, and is also always one of the goat doors (note that it is always possible to do this, since there are two goat doors). At this point, the contestant has the option of keeping his or her original choice, or swtiching to the other unopened door. The question is: is there any benefit to switching doors?

We can answer the problem by running simulations in Python. We'll do it in several parts.

First, we will write a function called simulate_prizedoor. This function will simulate the location of the prize in many games -- see the detailed specification below:


In [16]:
"""
Function
--------
simulate_prizedoor

Generate a random array of 0s, 1s, and 2s, representing
hiding a prize between door 0, door 1, and door 2

Parameters
----------
nsim : int
    The number of simulations to run

Returns
-------
sims : array
    Random array of 0s, 1s, and 2s

Example
-------
>>> print simulate_prizedoor(3)
array([0, 0, 2])
"""
def simulate_prizedoor(nsim):   
    return answer

def simulate_prizedoor(nsim):
    return np.random.randint(0, 3, (nsim))

Next, we will write a function that simulates the contestant's guesses for nsim simulations. Calling this function simulate_guess. The specs:


In [17]:
"""
Function
--------
simulate_guess

Return any strategy for guessing which door a prize is behind. This
could be a random strategy, one that always guesses 2, whatever.

Parameters
----------
nsim : int
    The number of simulations to generate guesses for

Returns
-------
guesses : array
    An array of guesses. Each guess is a 0, 1, or 2

Example
-------
>>> print simulate_guess(5)
array([0, 0, 0, 0, 0])
"""
#your code here

def simulate_guess(nsim):
    return np.zeros(nsim, dtype=np.int)

Next, we will write a function, goat_door, to simulate randomly revealing one of the goat doors that a contestant didn't pick.


In [18]:
"""
Function
--------
goat_door

Simulate the opening of a "goat door" that doesn't contain the prize,
and is different from the contestants guess

Parameters
----------
prizedoors : array
    The door that the prize is behind in each simulation
guesses : array
    THe door that the contestant guessed in each simulation

Returns
-------
goats : array
    The goat door that is opened for each simulation. Each item is 0, 1, or 2, and is different
    from both prizedoors and guesses

Examples
--------
>>> print goat_door(np.array([0, 1, 2]), np.array([1, 1, 1]))
>>> array([2, 2, 0])
"""

def goat_door(prizedoors, guesses):
    
    #strategy: generate random answers, and
    #keep updating until they satisfy the rule
    #that they aren't a prizedoor or a guess
    result = np.random.randint(0, 3, prizedoors.size)
    while True:
        bad = (result == prizedoors) | (result == guesses)
        if not bad.any():
            return result
        result[bad] = np.random.randint(0, 3, bad.sum())

Now , we will write a function, switch_guess, that represents the strategy of always switching a guess after the goat door is opened.


In [19]:
"""
Function
--------
switch_guess

The strategy that always switches a guess after the goat door is opened

Parameters
----------
guesses : array
     Array of original guesses, for each simulation
goatdoors : array
     Array of revealed goat doors for each simulation

Returns
-------
The new door after switching. Should be different from both guesses and goatdoors

Examples
--------
>>> print switch_guess(np.array([0, 1, 2]), np.array([1, 2, 1]))
>>> array([2, 0, 0])
"""
#your code here

def switch_guess(guesses, goatdoors):
    result = np.zeros(guesses.size)
    switch = {(0, 1): 2, (0, 2): 1, (1, 0): 2, (1, 2): 1, (2, 0): 1, (2, 1): 0}
    for i in [0, 1, 2]:
        for j in [0, 1, 2]:
            mask = (guesses == i) & (goatdoors == j)
            if not mask.any():
                continue
            result = np.where(mask, np.ones_like(result) * switch[(i, j)], result)
    return result

Last function: we will write a win_percentage function that takes an array of guesses and prizedoors, and returns the percent of correct guesses


In [20]:
"""
Function
--------
win_percentage

Calculate the percent of times that a simulation of guesses is correct

Parameters
-----------
guesses : array
    Guesses for each simulation
prizedoors : array
    Location of prize for each simulation

Returns
--------
percentage : number between 0 and 100
    The win percentage

Examples
---------
>>> print win_percentage(np.array([0, 1, 2]), np.array([0, 0, 0]))
33.333
"""
#your code here

def win_percentage(guesses, prizedoors):
    return 100 * (guesses == prizedoors).mean()

Now, putting it together. Simulating 10000 games where contestant keeps his original guess, and 10000 games where the contestant switches his door after a goat door is revealed. Computing the percentage of time the contestant wins under either strategy. Is one strategy better than the other?


In [21]:
#your code here

nsim = 10000

#keep guesses
print "Win percentage when keeping original door"
print win_percentage(simulate_prizedoor(nsim), simulate_guess(nsim))

#switch
pd = simulate_prizedoor(nsim)
guess = simulate_guess(nsim)
goats = goat_door(pd, guess)
guess = switch_guess(guess, goats)
print "Win percentage when switching doors"
print win_percentage(pd, guess).mean()


Win percentage when keeping original door
33.2
Win percentage when switching doors
66.14